METER: MEasuring TExt Reuse
نویسندگان
چکیده
In this paper we present results from the METER (MEasuring TExt Reuse) project whose aim is to explore issues pertaining to text reuse and derivation, especially in the context of newspapers using newswire sources. Although the reuse of text by journalists has been studied in linguistics, we are not aware of any investigation using existing computational methods for this particular task. We investigate the classi cation of newspaper articles according to their degree of dependence upon, or derivation from, a newswire source using a simple 3-level scheme designed by journalists. Three approaches to measuring text similarity are considered: ngram overlap, Greedy String Tiling, and sentence alignment. Measured against a manually annotated corpus of source and derived news text, we show that a combined classi er with features automatically selected performs best overall for the ternary classi cation achieving an average F1-measure score of 0.664 across all three categories.
منابع مشابه
Using the XARA XML-Aware Corpus Query Tool to Investigate the METER Corpus
The METER (MEasuring TExt Reuse) corpus is a corpus designed to support the study and analysis of journalistic text reuse. It consists of a set of news stories written by the Press Association (PA), the major UK news agency, and a set of stories about the same news events, as published in various British newspapers, some of which were derived from the PA version and some of which were written i...
متن کاملThe METER Corpus: A corpus for analysing journalistic text reuse
As a part of the METER (MEasuring TExt Reuse) project we have built a new type of comparable corpus consisting of annotated examples of related newspaper texts. Texts in the corpus were manually collected from two main sources: the British Press Association (PA) and nine British national newspapers that subscribe to the PA newswire service. In addition to being structured to support efficient s...
متن کاملPostgraduate Transfer Report.PDF
This thesis builds upon our current understanding of text reuse by proposing a hypothetical framework of text reuse and applying this abstract definition to a specific domain, that of journalistic reuse. The framework aims to explore a suitable measure of reuse and determine suitable discriminators for document derivation. Although text can be reused verbatim (word-for-word), in most cases, tex...
متن کاملBuilding and annotating a corpus for the study of journalistic text reuse
In this paper we present the METER Corpus, a novel resource for the study and analysis of journalistic text reuse. The corpus consists of a set of news stories written by the Press Association (PA), the major UK news agency, and a set of stories about the same news events, as published in various British newspapers. In some cases the newspaper stories are rewritten from the PA source; in other ...
متن کاملMeasuring Text Reuse in a Journalistic Domain
This paper describes a general framework for measuring text reuse. This term is used to describe how content from a single or multiple number of known sources can be reused either verbatim (word-for-word copy) or otherwise rewritten depending upon factors influencing the creation of a new document. These may include reduction/ increase in length, change of style, simplification of content, shif...
متن کامل